Chapter 4: Dimensionality reduction technique

1 Data wrangling

1.1 Mutate the data

“Mutate the data: transform the Gross National Income (GNI) variable to numeric (using string manipulation). Note that the mutation of ‘human’ was NOT done in the Exercise Set. (1 point)”

1.1.1 read the data-set

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.6     ✔ dplyr   1.0.8
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
human <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human1.txt", 
                    sep =",", header = T)

1.1.2 label the variables

library(finalfit)
library(labelled)
## 
## Attaching package: 'labelled'
## The following object is masked from 'package:finalfit':
## 
##     remove_labels
hd <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human_development.csv")
## Rows: 195 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (6): HDI Rank, Human Development Index (HDI), Life Expectancy at Birth, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gii <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/gender_inequality.csv", na = "..")
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 195 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (9): GII Rank, Gender Inequality Index (GII), Maternal Mortality Ratio, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(hd);names(gii)
## [1] "HDI Rank"                              
## [2] "Country"                               
## [3] "Human Development Index (HDI)"         
## [4] "Life Expectancy at Birth"              
## [5] "Expected Years of Education"           
## [6] "Mean Years of Education"               
## [7] "Gross National Income (GNI) per Capita"
## [8] "GNI per Capita Rank Minus HDI Rank"
##  [1] "GII Rank"                                    
##  [2] "Country"                                     
##  [3] "Gender Inequality Index (GII)"               
##  [4] "Maternal Mortality Ratio"                    
##  [5] "Adolescent Birth Rate"                       
##  [6] "Percent Representation in Parliament"        
##  [7] "Population with Secondary Education (Female)"
##  [8] "Population with Secondary Education (Male)"  
##  [9] "Labour Force Participation Rate (Female)"    
## [10] "Labour Force Participation Rate (Male)"
hu.names <- rbind(data.frame(name = names(hd)), 
                  data.frame(name =names(gii)),
                  data.frame(name = 
                               c("ratio of Female and Male populations with secondary education", 
                                 "ratio of labor force participation of females and males")))
hu.names <- hu.names[-10,1] #the 10th row is country that comes up 2nd times

for(i in 1:19){
  var_label(human[i]) <- hu.names[i]
}

library(DT)
codebook <- rbind(data.frame(ff_glimpse(human)$Con[1]), data.frame(ff_glimpse(human)$Categorical[1]))
codebook$variable <- rownames(codebook)
codebook %>% datatable

1.1.3 mutate

library(stringr)
str(human$GNI)
##  chr [1:195] "64,992" "42,261" "56,431" "44,025" "45,435" "43,919" "39,568" ...
##  - attr(*, "label")= chr "Gross National Income (GNI) per Capita"
human$GNI <- str_replace(human$GNI, pattern = ",", replace = "") %>% 
  as.numeric
str(human$GNI)
##  num [1:195] 64992 42261 56431 44025 45435 ...

1.2 Exclude unneeded variables

“Exclude unneeded variables: keep only the columns matching the following variable names (described in the meta file above):”Country”, “Edu2.FM”, “Labo.FM”, “Edu.Exp”, “Life.Exp”, “GNI”, “Mat.Mor”, “Ado.Birth”, “Parli.F” (1 point)”

keep.var <- c("Country", "Edu2.FM", "Labo.FM", "Edu.Exp", "Life.Exp", "GNI", "Mat.Mor", "Ado.Birth", "Parli.F")

codebook <- codebook %>% filter(rownames(codebook) %in% c("Edu2.FM", "Labo.FM", "Edu.Exp", "Life.Exp", "GNI", "Mat.Mor", "Ado.Birth", "Parli.F"))

human <- human %>% dplyr::select(one_of(keep.var))

human %>% names()
## [1] "Country"   "Edu2.FM"   "Labo.FM"   "Edu.Exp"   "Life.Exp"  "GNI"      
## [7] "Mat.Mor"   "Ado.Birth" "Parli.F"

1.3 Remove rows

“Remove all rows with missing values (1 point)”

human.all <- human %>% filter(complete.cases(human))

1.4 Remove observations

“Remove the observations which relate to regions instead of countries. (1 point)”

last <- nrow(human.all) - 7
# choose everything until the last 7 observations
human.all <- human.all[1:last, ]

1.5 Define row names

“Define the row names of the data by the country names and remove the country name column from the data. The data should now have 155 observations and 8 variables. Save the human data in your data folder including the row names. You can overwrite your old ‘human’ data. (1 point)”

#row names
rownames(human.all) <- human.all$Country
human.all$Country <- NULL
dim(human.all)
## [1] 155   8

2 Analysis

2.1 Requirement 1

“Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)”

2.1.1 Show a graphical overview of the data

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
#define a function that allows me for more control over ggpairs
#this function produces point plot with fitted lines
my.fun.smooth <- function(data,    #my function needs 3 arguments
                          mapping,
                          method = "lm"){
  ggplot(data = data, #data is passed from ggpairs' arguments
         mapping = mapping)+#aes is passed from ggpairs' arguments
           geom_point(size = 0.3,  #draw points
                      color = "blue")+
           geom_smooth(method = method,  #fit a linear regression
                       size = 0.3, 
                       color = "red")+
           theme(panel.grid.major = element_blank(), #get rid of the grids
                 panel.grid.minor = element_blank(),
                 panel.background = element_rect(fill = "#F0E442", #adjust background
                                                 color = "black"))
} 

#define a function that allows me for more control over ggpairs
#this function produces density plot 
my.fun.density <- function(data, mapping, ...) { #notes are roughly same with above

    ggplot(data = data, mapping = mapping) +
       geom_histogram(aes(y=..density..),
                      color = "black", 
                      fill = "white")+
       geom_density(fill = "#FF6666", alpha = 0.25) +
       theme(panel.grid.major = element_blank(), 
             panel.grid.minor = element_blank(),
             panel.background = element_rect(fill = "#9999CC",
                                             color = "black"))
} 




ggpairs(human.all, #data
        lower = 
          list(continuous = my.fun.smooth), #lower half show points with fitted line
        diag =
          list(continuous = my.fun.density), #diagonal grids show density plots
        title = "Fig. 2.1.1 Relationships between variables") + #title
  theme (plot.title = element_text(size = 22,  #adjust title visuals
                                   face = "bold")) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.1.2 Show summaries of the variables in the data

2.1.2.1 summarising the descriptive statistics

library(finalfit)
ff_glimpse(human.all)$Con %>% 
  datatable (caption = "Tab. 2.1.2.1 Discriptive statistics for variables")

2.1.2.1 Summarising the distributions

The distribution of variables is shown in the diagonal grids of fig 2.2.1 (see above). For clear presentation, it is visualized singly again. See fig 2.1.2.1 below.

#save clearer names to the lables so that they have better readability in plot
label.name <- paste(codebook$variable, paste("\n(",codebook$label,")"))
names(label.name) <- c("Life.Exp",
                  "Edu.Exp",
                  "Mat.Mor",
                  "Ado.Birth",
                  "Parli.F",
                  "Edu2.FM",
                  "Labo.FM", 
                  "GNI")

#plot it
human.all %>% 
  pivot_longer(everything()) %>%  #longer format
  ggplot(aes(x = value)) + #x axis used variable "value" (a default of pivot)
  geom_histogram(aes(y = ..density..), #match ys of density and histogram plots
                 color = "black", #my favorite border color
                 fill = "#9999CC")+  # I heard this is a beautiful color
  geom_density(fill = "pink", 
               alpha = 0.25)+ #adjust the aesthetics for density plot
  facet_wrap(~name, scales = "free", #wrap by name variable
             labeller = labeller(name = label.name)) + #use the label I set above
  theme(panel.grid.major = element_blank(), #get rid of the ugly grids
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "white",#adjust the background
                                        color = "black"),
        strip.background = element_rect(color = "black",#adjust the strips aes
                                        fill = "steelblue"),
        strip.text = element_text(size =12, 
                                  color = "white"), #adjust the strip text
        axis.title.x = element_text(size = 20), #adjust the x text
        axis.title.y = element_text(size = 20), # adjust the y text
        plot.title = element_text(size = 22, face = "bold"))+ #adjust the title
  labs(title = "Fig. 2.1.2.1 Distribution of variables", #title it
       )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.1.3 Describe and interpret the outputs

According to their distribution shapes visualized in fig 2.2.1 (see above), non-normally distributed variables were reported as median (Q1-Q3); roughly normal variables will be reported as mean±sd.

The respondents have a life expectancy at birth (Life.Exp) of 74.2(66.3-77.2) years; an expected years of education (Edu.Exp) of 13.2±2.8 years; a maternal mortality ratio (Mat.Mor) of 49 (11.5-190) per 100,000 births; an adolescent birth rate (Ado.Birth) of 33.6 (12.6 - 72.0) per 1,000 women; a percent representation in parliment (Parli.F) of 20.9±11.5 per 100 women; a ratio of Female and Male populations with secondary education (Edu2.FM) of 0.9 (0.7-1.0); a ratio of labor force participation of females and males (Labo.FM) of 0.8 (0.6-0.9); a Gross National Income (GNI) per Capita(GNI) of $12040.0 (4197.5-24512.0).

2.1.4 Commenting on the distributions of the variables and the relationships between them.

2.1.4.1 Distribution

From the figure above, I found: a. Adolescent birth rate is skewed to right with the mode being at around 25. This indicates most of the countries have a rate of around 25 births per 1,000 women, while a small number of countries can have birth rate as high as 200 births per 1,000 women. For this indicator, number below zero is impossible, which might the the reason why it goes right-skewed.

b. Expected years of education is roughly normally distributed, with the data points centering on around 1~13 years. This makes sense since it is the years taken to finish the education before college for most countries.

c. Ratio of Female and Male populations with secondary education is moderately left-skewed with a mode of 0.9. Both a mode less than one and long left tail demonstrates the fact that gender inequality is still presented across the world. However, roughly 1/2 of the data points fall within 0.9 to 1.1, which means in half of the countries it is not that big an issue any longer.

d. Like any indicators related to income, GNI is right-skewed. This makes sense since no one will make negative amount of money while some people can be super rich. The value centers around 10,000 dollars.

e. Ratio of labor force participation of females and males have roughly similar distribution with variable “Ratio of Female and Male populations with secondary education”. Since they are different indicators trying to capture similar idea(gender inequality), this consistency is legitimate and a sign of their reliability.

f. Life expectancy at birth centers around 75 years with long left tail and a shorter right tail stopping abruptly at 100 years. Besides, The distribution has negative kurtosis. These can be well-explained by the fact that thanks to modern medicine people are enjoying longer lifetime but there is always to limit to mankind’s longevity.

g. Maternal mortality Ratio centers at very small number around 30 per 100,000 births in a positive kurtosis manner with long right skewness. This corresponds to the fact that maternal healthcare has long been a mature social service across many countries (center on small number and has positive kurtosis), and also that some of the extremely under-developed countries are still not providing it properly (long tail to right). Note that there is a small peak around 450 per 100,000 births, which might be the influence of nature’s base rate of mortality.

h. Percent Presentation in Parliament is roughly normally distributed with slight right skewness. Sorry I have no idea how parliament works and will not comment on this indicator too much.

2.1.4.2 Relationships

Correlation between variables is visualized by scatter plots in fig 2.2.1. For better readability, I will also print out the correlation matrix down here.

library(corrplot)
## corrplot 0.92 loaded
cor(human.all) %>% 
  round(digits = 2) %>% 
  datatable(caption = 
              "Tab. 2.1.4.2 Correlation Coeeficients bewteen the variables")

It can be observed that six out of eight variables have a correlation efficient >0.4 with at least one other variable. The remaining two variables are Labo.FM and Parli.F. Their weaker correlation with other variable might be due to a number of pronounced con-founders. For example, women’s labor force participation is culturally less welcomed in some Asian cultures for reasons other than gender inequality; it is also influenced by the industrial structure of a country (countries with developed 3rd industry might have more women employed); only 46% of the world population are living under full or flawed democracy, and parliament does not even exist in a number of countries.

2.2 Requirement 2

Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables.

2.2.1 Perform principal component analysis (PCA) on the raw (non-standardized) human data

PCA is performed on un-standardized data set.

# perform principal component analysis with 
# results passing into pca.human
pca.human <- prcomp(human.all) #PCA

2.2.2 Show the variability captured by the principal components

2.2.2.1 show variability numerically

First, I checked the names of the lists of the pca results.

names(pca.human) #get the names of data set
## [1] "sdev"     "rotation" "center"   "scale"    "x"

By reading the help file, I get the idea that list “sdev” is the square roots of the eignvalues of the components. Since variability captured by a component vc using eignvalue ev can be formulated as:

\[ vc =\frac{ev_{i}^{2}}{ \sum_{i=1}^{n}ev_{i}^{2}} \]

The variability captured by each component is hence calculated.

eig <- pca.human$sdev^2
for (i in 1:8){ # the loop will run through 8 components
  a <- eig[i]/sum(eig) #use the formula above
  b <- paste("PC", i , "captures", a*100, "% of varibility") #improve readability
  print(b) #print it out
}
## [1] "PC 1 captures 99.9897649093306 % of varibility"
## [1] "PC 2 captures 0.0100076294642637 % of varibility"
## [1] "PC 3 captures 0.000184456576000702 % of varibility"
## [1] "PC 4 captures 3.81492929408603e-05 % of varibility"
## [1] "PC 5 captures 4.1243674737075e-06 % of varibility"
## [1] "PC 6 captures 7.12977441293848e-07 % of varibility"
## [1] "PC 7 captures 1.06301744648587e-08 % of varibility"
## [1] "PC 8 captures 7.36109912940859e-09 % of varibility"

It is found that PC1 explains 99.99% of the variability of the data set. Other components’ contribution in totality is less the 0.1%.

Package FactoMineR calculates the variability captured automatically, I will try using it and see if my calculation replicates its results.

library(FactoMineR)#for Multivariate Exploratory Data Analysis and Data Mining
library(factoextra)#to extract and visualize the output of PCA, CA and MCA
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
eig.human <- get_eigenvalue(pca.human)
eig.human[2]
##       variance.percent
## Dim.1     9.998976e+01
## Dim.2     1.000763e-02
## Dim.3     1.844566e-04
## Dim.4     3.814929e-05
## Dim.5     4.124367e-06
## Dim.6     7.129774e-07
## Dim.7     1.063017e-08
## Dim.8     7.361099e-09

They are same. PC1 explains 99.99% of the variability of the data set

2.2.2.2 show variability visually

#explore the varai
library(factoextra)#install.packages("factoextra")
fviz_eig(pca.human,
         barfill = "grey",
         barcolor = "black")+
  theme(panel.grid = element_blank()) +
  labs(title = "Fig. 2.2.2.2 Scree plot for eigenvalues across the components (un-standardized dataset)")

2.2.2.3 check the loading score numerically

Although this is not required by the assignment, I feel it could be also interesting to check the proportion of each variable’s contribution to each components (loading scores). This could shed light on the comparison between standardized and un-standardized datasets for PCA. By checking the help file of prcomp, I found it is easy to obtain these loading scores from the “rotation” list of PCA result

#obtain loading scores of my PCA result and pass it into pca.human.ls
pca.human.ls <- pca.human$rotation%>% data.frame
#check the loading scores
pca.human.ls
##                     PC1           PC2           PC3           PC4           PC5
## Edu2.FM   -5.607472e-06  0.0006713951 -3.412027e-05 -2.736326e-04 -0.0022935252
## Labo.FM    2.331945e-07 -0.0002819357  5.302884e-04 -4.692578e-03  0.0022190154
## Edu.Exp   -9.562910e-05  0.0075529759  1.427664e-02 -3.313505e-02  0.1431180282
## Life.Exp  -2.815823e-04  0.0283150248  1.294971e-02 -6.752684e-02  0.9865644425
## GNI       -9.999832e-01 -0.0057723054 -5.156742e-04  4.932889e-05 -0.0001135863
## Mat.Mor    5.655734e-03 -0.9916320120  1.260302e-01 -6.100534e-03  0.0266373214
## Ado.Birth  1.233961e-03 -0.1255502723 -9.918113e-01  5.301595e-03  0.0188618600
## Parli.F   -5.526460e-05  0.0032317269 -7.398331e-03 -9.971232e-01 -0.0716401914
##                     PC6           PC7           PC8
## Edu2.FM    2.180183e-02  6.998623e-01  7.139410e-01
## Labo.FM    3.264423e-02  7.132267e-01 -7.001533e-01
## Edu.Exp    9.882477e-01 -3.826887e-02  7.776451e-03
## Life.Exp  -1.453515e-01  5.380452e-03  2.281723e-03
## GNI       -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor    1.695203e-03  1.355518e-04  8.371934e-04
## Ado.Birth  1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F   -2.309896e-02 -2.642548e-03  2.680113e-03
#There are quite a number of extremely small numbers in each column. To get
#a better reading experience, I passed the largest score in each column to an
#object "largest.ls" and extracted their row names(the variable name), merging
#into a dataframe
largest.ls<- apply(pca.human.ls, 2, function(x)max(abs(x))) # largest scores
name<- rownames(pca.human.ls)[apply(pca.human.ls, 2, which.max)] # row names
data.frame(largest.ls = largest.ls, name =name) %>% datatable (caption = "Fig. 2.2.2.3 largest factor loading for each components") %>%  # data frame
  formatRound(columns = 1, digits = 2) #round to increase readability

It is interesting to find out that each variable loads almost exclusively on one components (Six out of 8 variables have loading score on one unique component larger than 0.99).

2.2.3 Draw a biplot

Quoted from the requirement: “Draw a biplot Biplot displaying the observations by the first two principal components, along with arrows representing the original variables.”

# draw a biplot of the principal component representation and the original variables
biplot(pca.human)#biplot
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
Fig. 2.2.3 Biplot for the first two components (unstandardized data set)

Fig. 2.2.3 Biplot for the first two components (unstandardized data set)

In the biplot, red texts stand for variables, while black texts for rows (countries). The position of GNI is far away from the origin (0,0) in the direction of x axis (PC1), indicating its strong contribution to PC1. Most of the countries clustered tightly around the origin (0,0), which points to the fact that they are not well-represented on the factor map.

I visualized another biplot using outside package. The graph will have different visual effect but same idea.

p1 <- fviz_pca_biplot(pca.human, #data used
                      repel = TRUE,  #avoid overlap
                      ggtheme = theme_test(), #a theme I like
                      title = "Fig. 2.2.1 a PCA for unstandardized dataset")
p1
## Warning: ggrepel: 155 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

The graph carrys idea that is same with the graph generated by base function.

2.3 Requirment 3

Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to.

2.3.1 Standardize the variables

human.all.scaled <- scale(human.all) #scale it!

2.3.2 Repeat the above analysis

2.3.2.1 PCA

spca.human <- prcomp(human.all.scaled) #PCA

2.3.2.1 Check variability captured by each component

eig <- spca.human$sdev^2 #sdev is the square roots of ev, turn it into ev
for (i in 1:8){ #this loop will go through 8 components
  a <- eig[i]/sum(eig) #use the formula I expressed above
  b <- paste("PC", i , "captures", a*100, "% of varibility") #make it readable
  print(b) #print it out
}
## [1] "PC 1 captures 53.6046258273805 % of varibility"
## [1] "PC 2 captures 16.237031039413 % of varibility"
## [1] "PC 3 captures 9.57137445323498 % of varibility"
## [1] "PC 4 captures 7.58284503035719 % of varibility"
## [1] "PC 5 captures 5.47732753268388 % of varibility"
## [1] "PC 6 captures 3.59530248187163 % of varibility"
## [1] "PC 7 captures 2.63350570981446 % of varibility"
## [1] "PC 8 captures 1.29798792524435 % of varibility"
fviz_screeplot(spca.human, 
               barfill = "grey",
               barcolor = "black",
               addlabel =T)+
  theme(panel.grid = element_blank())+
  labs(title = "Fig. 2.3.2.1 Scree plot for eigenvalues across each component \n(standardized dataset)")

Interesting change happens. Thanks to scaling, the “scree” doesn’t plummet this time. Comment will be given in section 2.3.3.

2.3.2.2 Check factor laodings of each variable on each component

#obtain loading scores of my PCA result and pass it into pca.human.ls
spca.human.ls <- spca.human$rotation%>% data.frame
#check the loading scores
spca.human.ls
##                   PC1         PC2         PC3         PC4        PC5
## Edu2.FM   -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
## Labo.FM    0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
## Edu.Exp   -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
## Life.Exp  -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
## GNI       -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor    0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth  0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
## Parli.F   -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
##                   PC6         PC7         PC8
## Edu2.FM    0.17713316  0.05773644  0.16459453
## Labo.FM   -0.03500707 -0.22729927 -0.07304568
## Edu.Exp   -0.38606919  0.77962966 -0.05415984
## Life.Exp  -0.42242796 -0.43406432  0.62737008
## GNI        0.11120305 -0.13711838 -0.16961173
## Mat.Mor    0.17370039  0.35380306  0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F    0.13749772  0.00568387 -0.02306476
#There are quite a number of extremely small numbers in each column. To get
#a better reading experience, I passed the largest score in each column to an
#object "largest.ls" and extracted their row names(the variable name), merging
#into a dataframe
largest.ls<- apply(spca.human.ls, 2, function(x)max(abs(x))) # largest scores
name<- rownames(spca.human.ls)[apply(spca.human.ls, 2, which.max)] # row names
data.frame(largest.ls = largest.ls, name =name) %>% datatable %>%  # data frame
  formatRound(columns = 1, digits = 2) #round to increase readability

Interesting change happens. The one-variable-to-one-component loading style has disappeared Comment will be given in section 2.3.3.

2.3.2.3 Biplot

biplot(spca.human) #biplot
Fig 2.3.2.3 Biplot for the first two components (standardized data set)

Fig 2.3.2.3 Biplot for the first two components (standardized data set)

This time the texts representing both the rows and columns are scattered away from each other and more column text (in red) are visualized.

I visualized another biplot using outside package. The graph will have different visual effect but same idea.

p2 <- fviz_pca_biplot(spca.human,  #data used
                      repel = TRUE,  #avoid overlap
                      ggtheme = theme_test(), #a theme I like
                      title = "Fig. 2.2.1 a PCA for unstandardized dataset")
p2
## Warning: ggrepel: 88 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

2.3.3 Interpret the results of both analysis

Some interesting findings by looking at both analyses are:

2.3.3.1 Difference between PCA results with standardized and un-standardized data sets

a. Variability explained

With un-standardized data set, PCA produced result showing PC1 explains 99.99% of the variability of the data set; other components’ contribution in totality is less the 0.1%. With standardized data set, PC1 and PC2 together explains 69.8% of the variability of the data set, with the amount of variability explained falling gradually for the components following.

b. Factor loading

With un-standardized data set, PCA produced result showing each variable loads almost exclusively on one components (Six out of 8 variables have loading score on one unique component larger than 0.99). While this one-variable-to-one-component loading style has disappeared in results from standardized data set.

b. Biplot

With un-standardized data set, most of the countries clustered tightly around the origin (0,0), which points to the fact that they are not well-represented on the factor map. Besides, only one variable GNI locates far away from the origin (0,0) in the direction of x axis (PC1), indicating its strong contribution to PC1. However, when I use standardized data set, row and column points are more well scattered across the coordinate panel and all the variables are visualized more reasonably.

2.3.3.2 Possible explanation for the difference

Base on finding above, it is not hard to draw the conclusion that PCA with standardized data set produces results better for analysis.

Possible explanation for this is the different scales of the variables make it difficult to compare each pair of features. PCA is calculated based on co-variance. Unlike correlation, which is dimensionless, covariance is in units obtained by multiplying the units of the two variables. When data set is not scaled, this makes each variable not easily comparable with others (since they all have their own value ranges). Further, each variable loads almost exclusively on one components because they can hardly find another variable with comparable value range. This assumption is further consolidated by the fact that the only two variables with smaller loading scores are Edu2.FM and Labo.FM, both of which happen to have similar value range from 0 to 1.

Also, co-variance also gives some variable extremely high leverage in our data set. To better deliver the idea, I reproduce the table from 2.1.21 about variable descriptions:

ff_glimpse(human.all)$Con %>%  #"$Con" calls for the continuous variables
  datatable (caption = "Tab. 2.1.2.1 Discriptive statistics for variables")

The table (see the column “mean”) has shown GNI has a scale tremendously larger than other variables. This might lead to its large co-variances with any other variable, and further results in its over-contribution to the factor solution.

All of these mis-representation of data would cause the poor quality of contribution, and hence the biplot shows most of the countries clustered tightly together, indicating the PCA has not produced a factor map with acceptable dissimilarity between rows. Also, the over-contribution of GNI to the factor solution leads to a graph with only one variable GNI showing in visible range.

2.4 Requirement 4

Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data

To improve readability, I will visualize the biplot 2.2.1 again and interpret it.

p2 # print picture p2
## Warning: ggrepel: 88 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

The scattered row names exhibit they are well-represented by the factor map. For example, country Rwanda’s profile is better represented by PC2 (Dim2, far away from origin in the direction of y axis), while Niger is better represented by PC1 (Dim1, far away from origin in the direction of x axis). Same idea, variables Labo.FM and Parli.F have strong contribution to positive side of PC2(Dim2), while Mat.Mor and ado.birth made a good amount of contribution to the positive side of PC1(Dim1). The Education related variables, GNI and life.EXp contributes more to the negative side of PC1 (Dim1).

2.5 Assignment 5

Quoted from the assignment :

“Load the tea dataset from the text file https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv where the column separator is a comma and the first row includes the column names).

Explore the data briefly: look at the structure and the dimensions of the data. Use View(tea) to browse its contents. As you see, all variables are categorical. Convert them explicitly to factors, for example: tea\(sugar <- factor(tea\)sugar) and then visualize the data.

Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots. (0-4 points)”

2.5.1 Load the tea dataset

tea <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", 
                    sep =",", header = T)
tea.full <- tea #this is not actually functional, just keep the original data set
                # in case I mess up anything.

2.5.2 Explore the data briefly

The data used here concern a questionnaire on tea. A number of 300 participants were asked about how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions).

2.5.2.1 Structure

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : chr  "breakfast" "breakfast" "Not.breakfast" "Not.breakfast" ...
##  $ tea.time        : chr  "Not.tea time" "Not.tea time" "tea time" "Not.tea time" ...
##  $ evening         : chr  "Not.evening" "Not.evening" "evening" "Not.evening" ...
##  $ lunch           : chr  "Not.lunch" "Not.lunch" "Not.lunch" "Not.lunch" ...
##  $ dinner          : chr  "Not.dinner" "Not.dinner" "dinner" "dinner" ...
##  $ always          : chr  "Not.always" "Not.always" "Not.always" "Not.always" ...
##  $ home            : chr  "home" "home" "home" "home" ...
##  $ work            : chr  "Not.work" "Not.work" "work" "Not.work" ...
##  $ tearoom         : chr  "Not.tearoom" "Not.tearoom" "Not.tearoom" "Not.tearoom" ...
##  $ friends         : chr  "Not.friends" "Not.friends" "friends" "Not.friends" ...
##  $ resto           : chr  "Not.resto" "Not.resto" "resto" "Not.resto" ...
##  $ pub             : chr  "Not.pub" "Not.pub" "Not.pub" "Not.pub" ...
##  $ Tea             : chr  "black" "black" "Earl Grey" "Earl Grey" ...
##  $ How             : chr  "alone" "milk" "alone" "alone" ...
##  $ sugar           : chr  "sugar" "No.sugar" "No.sugar" "sugar" ...
##  $ how             : chr  "tea bag" "tea bag" "tea bag" "tea bag" ...
##  $ where           : chr  "chain store" "chain store" "chain store" "chain store" ...
##  $ price           : chr  "p_unknown" "p_variable" "p_variable" "p_variable" ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : chr  "M" "F" "F" "M" ...
##  $ SPC             : chr  "middle" "middle" "other worker" "student" ...
##  $ Sport           : chr  "sportsman" "sportsman" "sportsman" "Not.sportsman" ...
##  $ age_Q           : chr  "35-44" "45-59" "45-59" "15-24" ...
##  $ frequency       : chr  "1/day" "1/day" "+2/day" "1/day" ...
##  $ escape.exoticism: chr  "Not.escape-exoticism" "escape-exoticism" "Not.escape-exoticism" "escape-exoticism" ...
##  $ spirituality    : chr  "Not.spirituality" "Not.spirituality" "Not.spirituality" "spirituality" ...
##  $ healthy         : chr  "healthy" "healthy" "healthy" "healthy" ...
##  $ diuretic        : chr  "Not.diuretic" "diuretic" "diuretic" "Not.diuretic" ...
##  $ friendliness    : chr  "Not.friendliness" "Not.friendliness" "friendliness" "Not.friendliness" ...
##  $ iron.absorption : chr  "Not.iron absorption" "Not.iron absorption" "Not.iron absorption" "Not.iron absorption" ...
##  $ feminine        : chr  "Not.feminine" "Not.feminine" "Not.feminine" "Not.feminine" ...
##  $ sophisticated   : chr  "Not.sophisticated" "Not.sophisticated" "Not.sophisticated" "sophisticated" ...
##  $ slimming        : chr  "No.slimming" "No.slimming" "No.slimming" "No.slimming" ...
##  $ exciting        : chr  "No.exciting" "exciting" "No.exciting" "No.exciting" ...
##  $ relaxing        : chr  "No.relaxing" "No.relaxing" "relaxing" "relaxing" ...
##  $ effect.on.health: chr  "No.effect on health" "No.effect on health" "No.effect on health" "No.effect on health" ...

All the variables are of character type, excpet for age, which is integer.

2.5.2.2 Dimensions

dim(tea)
## [1] 300  36

The data set records information of 300 obs. on 36 variable

2.5.2.3 Browse content

view(tea) %>% head
##       breakfast     tea.time     evening     lunch     dinner     always home
## 1     breakfast Not.tea time Not.evening Not.lunch Not.dinner Not.always home
## 2     breakfast Not.tea time Not.evening Not.lunch Not.dinner Not.always home
## 3 Not.breakfast     tea time     evening Not.lunch     dinner Not.always home
## 4 Not.breakfast Not.tea time Not.evening Not.lunch     dinner Not.always home
## 5     breakfast Not.tea time     evening Not.lunch Not.dinner     always home
## 6 Not.breakfast Not.tea time Not.evening Not.lunch     dinner Not.always home
##       work     tearoom     friends     resto     pub       Tea   How    sugar
## 1 Not.work Not.tearoom Not.friends Not.resto Not.pub     black alone    sugar
## 2 Not.work Not.tearoom Not.friends Not.resto Not.pub     black  milk No.sugar
## 3     work Not.tearoom     friends     resto Not.pub Earl Grey alone No.sugar
## 4 Not.work Not.tearoom Not.friends Not.resto Not.pub Earl Grey alone    sugar
## 5 Not.work Not.tearoom Not.friends Not.resto Not.pub Earl Grey alone No.sugar
## 6 Not.work Not.tearoom Not.friends Not.resto Not.pub Earl Grey alone No.sugar
##       how       where           price age sex          SPC         Sport age_Q
## 1 tea bag chain store       p_unknown  39   M       middle     sportsman 35-44
## 2 tea bag chain store      p_variable  45   F       middle     sportsman 45-59
## 3 tea bag chain store      p_variable  47   F other worker     sportsman 45-59
## 4 tea bag chain store      p_variable  23   M      student Not.sportsman 15-24
## 5 tea bag chain store      p_variable  48   M     employee     sportsman 45-59
## 6 tea bag chain store p_private label  21   M      student     sportsman 15-24
##   frequency     escape.exoticism     spirituality     healthy     diuretic
## 1     1/day Not.escape-exoticism Not.spirituality     healthy Not.diuretic
## 2     1/day     escape-exoticism Not.spirituality     healthy     diuretic
## 3    +2/day Not.escape-exoticism Not.spirituality     healthy     diuretic
## 4     1/day     escape-exoticism     spirituality     healthy Not.diuretic
## 5    +2/day     escape-exoticism     spirituality Not.healthy     diuretic
## 6     1/day Not.escape-exoticism Not.spirituality     healthy Not.diuretic
##       friendliness     iron.absorption     feminine     sophisticated
## 1 Not.friendliness Not.iron absorption Not.feminine Not.sophisticated
## 2 Not.friendliness Not.iron absorption Not.feminine Not.sophisticated
## 3     friendliness Not.iron absorption Not.feminine Not.sophisticated
## 4 Not.friendliness Not.iron absorption Not.feminine     sophisticated
## 5     friendliness Not.iron absorption Not.feminine Not.sophisticated
## 6 Not.friendliness Not.iron absorption Not.feminine Not.sophisticated
##      slimming    exciting    relaxing    effect.on.health
## 1 No.slimming No.exciting No.relaxing No.effect on health
## 2 No.slimming    exciting No.relaxing No.effect on health
## 3 No.slimming No.exciting    relaxing No.effect on health
## 4 No.slimming No.exciting    relaxing No.effect on health
## 5 No.slimming No.exciting    relaxing No.effect on health
## 6 No.slimming No.exciting    relaxing No.effect on health

2.5.3 Convert to factors

#use lapply function to generate a loop that runs through 
#every column and convert it to factor
tea[sapply(tea, is.character)] <- lapply(tea[sapply(tea, is.character)],
                                        as.factor)

2.5.4 Visualiz after conversion

library(tidyverse)
tea %>% dplyr::select(-age) %>%  #age is continuous, remove it
  pivot_longer(everything()) %>%  # I need long format
  ggplot(aes(x = value, fill = value))+ # "Value" is assigned by piovt_longer
  geom_bar(color = "black", fill ="grey")+ #I like black border for the bars
  theme_bw()+ # a cool theme
  facet_wrap(~name, scale ="free")+ #wraps the value by variable "name"
  theme(legend.position = "none")+ # legend not necessary here
  theme(axis.text.x = element_text(size = 12, #x axis text settings
                                   angle = 45,
                                   vjust =1,
                                   hjust =1),
        strip.text = element_text(size = 10, #adjust the strips 
                                  face = "bold",
                                  color = "black"),
        strip.background = element_rect(color = "black", #adjust the background
                                        fill = "lightgrey"),
        axis.title = element_blank(), #remove axis titles
        plot.title = element_text(size =25))+ #make title bigger
  labs(title = "Frequency of each level of the variables") #name it

It can be observed that some variables have level(s) with very low frequency. This produces the risk in distorting the analysis.

2.5.5 variable selection

Variable levels with very low frequency can distort the analysis and should be removed. Here I set an arbitrary cutoff 50, meaning any variable with at least one level below 30 would be removed.

#remove age variable since its information has already been covered by age_Q variable
tea <- tea %>% dplyr::select(-age)
vector.tea <- NA  #start a NULL vector 
for (i in 1:35){  #my loop will go through 35 columns
  table.tea <- table(tea[i]) #save the level frequency for each var into table.tea
  if (min(table.tea)>=50){ #if the smaller frequency is larger than 50
    vector.tea[i] <- colnames(tea[i]) #then save its nave ito vector.tea
  } 
}

keep.col.names <- vector.tea[complete.cases(vector.tea)]
tea <- tea %>% 
  dplyr::select(one_of(keep.col.names))#remove variable with small category

2.5.6 Visualiz after removal

tea %>%  
  pivot_longer(everything()) %>%  #long format of data make "wrap" possible
  ggplot(aes(x = value))+   #define x axis variable. "Value" is assigned by pivot
  geom_bar(color = "black", fill = "grey")+ #I like grey bars with black border
  theme_bw()+ #a cool theme that I like
  facet_wrap(~name, scale ="free")+ # wrap the value by variable names
  theme(legend.position = "none")+ #legend is not functional here, remove it
  theme(axis.text.x = element_text(size = 12,  
                                   angle = 45,  #adjust angle
                                   vjust =1,  #adjust vertical position
                                   hjust =1), #adjust horizontal position
        strip.text = element_text(size = 10,
                                  face = "bold", # make title more visible
                                  color = "black"),
        strip.background = element_rect(color = "black", #I want the strip consistent
                                        fill = "lightgrey"), # with the bar
        axis.title = element_blank(), # remove x axis text. 
        plot.title = element_text(size =25))+ #adjust title size
  labs(title = "Frequency of each level of the variables") #name it!

2.5.5 Multiple Correspondence Analysis on the data set

mca.tea <- MCA(tea, graph = FALSE) #Multiple correspondence analysis done

2.5.6 Interpret the results

2.5.6.1 Eigenvalues- interpretation with visualization

Eigenvalues measure the proportions of variances retained by the different dimensions. It will be printed to observe how the first several dimensions (components) represent the full data set.

a <- get_eigenvalue(mca.tea) #obtain the eigenvalue of the mca results
fviz_screeplot(mca.tea,
               barfill = "grey",
               barcolor = "black",
               addlabels = TRUE)+ #screeplot with labels
  labs(title= "Fig. 2.5.6.1 Variance explained by the top 10 dimensions")+
  theme(panel.grid = element_blank())#label

The eigenvalue of the first 3 dimensions are 0.109, 0.072 and 0.069. Since the eigenvalues of all 22 dimensions adds up to 1, an eigenvalue×100% will be the percentage of variance of the data set captured by this very dimension. Altogether, the first 2 dimensions captured 18.83% variability, indicating quite an amount of insights will be missed out using 2-dimension solution to represent the variation between variables. In the 2-dimension plot that I am going to generate in the following sections, highly differentiated variables’ leverage in discrimination might be underestimated, and indistinct variables might be highly differentiated on some dimensions other than the first 2. The full results of this MCA should be interpreted on the basis of this knowledge.

2.5.6.2 COS2 and visualization

COS2 (squared cosine) measures the quality of the representation for variables in factor map(2-dim solution). The higher cos2, the better. Variables with low cos2 should be treated and interpreted with caution. If a variable category is well represented by two dimensions, the sum of the cos2 is closed to one.

CO2 values for first 2 dimensions across each variable will be printed out and added up by each variable. This is to explore how good the variables in our data set can be represented by the first two dimensions.

#extract cos2 values for the 1st 2 dimensions
cos2.dim <- mca.tea$var$cos2 [,1:2]%>% data.frame() #save as data frame 
#save row names as a variable, so that ordering by column is possible
cos2.dim$variable <- rownames(cos2.dim)
#delete the row names since we already saved them into a new variable
rownames(cos2.dim) <- NULL
#calculate the addition of each row
cos2.dim <- cos2.dim %>% mutate(Dim.both = Dim.1 + Dim.2)
#show the top 10 largest results by descending order 
cos2.dim[order(-cos2.dim$Dim.both), c(3,4)][1:10,]
##        variable  Dim.both
## 22            M 0.4236926
## 21            F 0.4236926
## 36 Not.feminine 0.4173388
## 35     feminine 0.4173388
## 6   Not.evening 0.3458466
## 5       evening 0.3458466
## 12      tearoom 0.3117687
## 11  Not.tearoom 0.3117687
## 14  Not.friends 0.3097810
## 13      friends 0.3097810

As discussed above, the more the sum of the cos2 for a variable is closed to one, the better it is represented by the two dimensions.Reviewing the results, it is found that no added cos2 values is larger than 0.5. Only 4 variables have cos2 values (addition of the first two dimensions) larger than 0.4. These indicate only a limited number of variables were roughly represented by the first two dimensions, and the strength of variable relationship could be heavily under-estimated by by plots generated in the next sections.

For better understanding, a Variable Plot of MCA with high COS2 is generated (fig. 2.5.6.2). The results correspond to the finding from values of cos2. Note that variable categories such as tearoom andfeminine are relatively well-represented by both dimensions since they are far away from the origin in both x and y axes. Variable categories such as evening are well-represented by one dimension instead of the other since they are far away from the origin in either x or y axis (for evening, it is y). Variable categories such as friends are relatively well-represented by none of the dimensions since they are closed to origin from both directions.

p1 <- fviz_mca_var(mca.tea, 
                   col.var = "cos2", 
                   repel = TRUE, # avoid overlap
                   ggtheme = theme_test(), #theme that I like
                   gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
                   select.var = list(cos2 = 10))+ # limited the number of visualized
  labs(title = "Fig 2.5.6.2 Quality of representation for variables \n(top 10 highly represented variables)")
                                            #variables to 10 (largest cos2)
                                         
p1

2.5.6.3 Contribution and visualization

Contr measures the percentage contributions of the each variable to the definition (inertia) of the dimensions. The contr of the first two dimensions will be displayed below. The sum of contr for each dimension will be 1. Larger Contr means the variable captures larger inertia of the dimension. Contr has similar functions as factor loading scores as that in principal component analysis, hence by reviewing the variable categories having strong contribution to a dimension, it is possible to assume a construct reflected underneath. Unfortunately, I did not find a clear explanation of variable categories of tea data set, this is not possible here.

#extract contr values for the 1st 2 dimensions
tea.contr <- mca.tea$var$contrib[,1:2] %>% data.frame()
#save row names to a new variable
tea.contr$variable <- rownames(tea.contr)
#delete old row names.
rownames(tea.contr) <- NULL
#order by Dim.1, showing the top 10
tea.contr[order(-tea.contr$Dim.1), c(1,3)][1:10,]
##        Dim.1         variable
## 12 10.062284          tearoom
## 34 10.054077 Not.friendliness
## 22  8.120778                M
## 3   6.347914     Not.tea time
## 14  5.666634      Not.friends
## 21  5.565927                F
## 4   4.920572         tea time
## 18  4.868037              pub
## 35  4.816059         feminine
## 16  4.005020            resto

Variable categories tearoom and not.friendliness, each contributed more than 10% to the inertia of dimension one. Reviewing the important contributors to this dimension, I get the idea that the dimension tries to capture the social side of tea-drinking. However, since no original survey questions is available, no explicit conclusion can be drawn here.

#order by Dim.2 showing the top 10
tea.contr[order(-tea.contr$Dim.2), c(2,3)][1:10,]
##        Dim.2          variable
## 5  11.480348           evening
## 35  7.025548          feminine
## 16  6.576528             resto
## 6   6.002415       Not.evening
## 37  5.604762 Not.sophisticated
## 36  5.299975      Not.feminine
## 41  4.991827       No.relaxing
## 23  4.755134     Not.sportsman
## 32  3.887125      Not.diuretic
## 14  3.793022       Not.friends

Variable category evening contributed more than 10% to the inertia of dimension two. Reviewing the important contributors to this dimension, I get the idea that the dimension tries to capture the leisure side of tea-drinking. However, since no original survey questions is available, no explicit conclusion can be drawn here.

Below is a plot about the contributions of top 15 variables to dimension one (Fig 2.5.6.3). It demonstrates similar insights with the numeric results, except for more variables is included here.

#bar plot of the top 15 most contributed variable levels
fviz_contrib(mca.tea, 
             fill = "grey", 
             color = "black", 
             choice = "var", 
             axes = 1, 
             top = 15)+
  labs(title = "Fig 2.5.6.3 a :top 15 most contributed variable levels to dimension one")+
  theme(panel.grid = element_blank())

Below is a plot about the contributions of top 15 variables to dimension two. It carries similar information with the numeric results, except for more variables is included here.

#bar plot of the top 15 most contributed variable levels
fviz_contrib(mca.tea,
             fill = "grey", 
             color = "black", 
             choice = "var", 
             axes = 2, 
             top = 15)+
  labs(title = "Fig 2.5.6.3 b :top 15 most contributed variable levels to dimension two")+
  theme(panel.grid = element_blank())

Below is a plot about the contributions of top 10 variables to both dimension one&two (fig 2.5.6.3). It carries similar information with the numeric results. Note that variables contribute highly to dimension one tend to show up in the far end in the direction of x axis while those contribute highly to dimension two will show up in the far end in the direction of y axis. The levels of contribution are also reflected by color gradients using the addition of the contr of both dimensions. Note that variable category “evening” contributes highly to dimension 2 but only has a color indicating moderate contribution. This is because its low contribution to dimension one drags its importance down.

#visualize the variable plot (row names are kept out of the plot)
p2 <- fviz_mca_var(mca.tea, col.var = "contrib",  
            repel = TRUE,  #avoid overlap
            ggtheme = theme_test(), #theme that I like
            gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             select.var = list(contrib = 10))+# variables with top 10 largest 
  labs(title = "Fig 2.5.6.3 c: Contribution of variables to first two dimensions \n(10 most important contributing variables) " )
                                              #contribution  are shown
p2

Below I display plots 2.5.6.2 and 2.5.6.3c in one graph so that cos2 (contribution of dimensions to variable) and contr (contribution of variables to dimensions) can be compared. Most variable categories have consistent strength of cos2 and contr such as evening, tearoom, Male, Not.friends. Some, such as friends, have moderate strength in cos2 but not in contr, indicating it might be well-represented by the dimensions but its contribution to the dimensions either is explained by other variables or its contribution distributes across the dimension with neither one pronounced.

library(patchwork)
p1+p2

2.5.6.4 correlation between variables and principal dimensions

The correlation between variables and first two dimensions are visualized here.

fviz_mca_var(mca.tea, choice = "mca.cor", 
            repel = TRUE, 
            ggtheme = theme_minimal())+
  theme(panel.grid = element_blank(),
          panel.background  = element_rect(color = "black"))

No variable is highly correlated with any dimension. Variables such as sex, tearoom, friendliness and tea.time are moderately correlated with dimension one; variables such as evening are moderately correlated with the second dimension; variables such as feminine and resto are moderately correlated with both dimensions. Variables clustered closely to the origin are not noticeably correlated with any dimensions.

2.5.6.5 classic variable plot

The classic variable plot is actually the cos2 plot. However, I visualize it again without excluding poorly represented variables, and will try to give a more thorough interpretation.

fviz_mca_var(mca.tea, 
             repel = TRUE,
             ggtheme = theme_test(),
             title = "Fig 2.5.6.5 Classic variable plot for the first two dimensions")# theme that I like

Before interpreting the plot, I emphasize the fact we have found that the first two dimensions only explain 19% of the total inertia contained in the data, indicating not all the points are equally well displayed in the two dimensions. Categories that do not differentiate well in the plot might still very possibly to be distinct in terms of other dimensions outside the first two.

Variable categories clustered closely around the origin such as health, not.breakfast, not.pub, sportsman and not.spirituality might have weak to none association. Some points that are further from the origin, such as evening, resto, pub, tearoom, not.friends, no.relaxing, feminine, not.sportsman and not friendliness, are quite discriminating. Among them, tearoom and pub are well represented by the first dimension and have strong positive correlation. Not.friendliness is also well-represented by dimension one, while it is negatively correlated with tearoom and pub since it positions in the opposed quadrant with them.

evening and resto are well represented by the 2nd dimension and have strong positive correlation. No.relaxing, non.sportsman and feminine are also well-represented by dimension 2, while they are negatively correlated with evening and resto since they position in the opposed quadrant.